Predictability of Kullback-Leibler (KL) Divergence#
Introduction#
Streamflow prediction in ungauged basins has classically focused on minimizing one or more loss functions, typically square error, NSE, and KGE, between observed (daily streamflow) values and predictions generated by some type of model, whether it be an empirical regionalization, statistical machine learning, or process-based rainfall runoff model. Regionalization and machine learning models depend upon an existing streamflow monitoring network for PUB, and the performance of these models is linked to the existing network’s representativeness of the ungauged space. This link raises the question how varying the network arrangement affects the performance of PUB models overall in terms of an expectation of prediction error, that is prediction error across all ungauged locations, and by extension whether there exist environmental signals orthogonal to streamflow with sufficient information to discriminate between network arrangements to minimize the expectation of prediction error over the ungagued space.
A simple interpretation of the loss functions commonly used in the PUB literature might be “how close are mean daily streamflow predictions to observed values?” A much simpler question to ask of observational data is: “will a given model outperform random guessing in the long run?”. This binary question represents a starting point to approach the optimal streamflow monitoring network problem. The justification for asking such a basic question is that an expectation of the uncertainty reduction over the unmonitored space provided by a given monitoring arrangement supports a discriminant function to compare unique arrangements. In addition, more complex questions quickly become intractable given that the number of comparisons grows exponentially with the size of the decision space (possible monitoring locations). A simple, informative problem formulation is then possible to test on real data, in this case an ungauged space of over 1 million ungauged catchments and a set of over 1600 monitored catchments with which to train a model.
Problem Formulation#
The question “is a given model better than random guessing in the long run” is formulated into a binary classification problem as follows:
The streamflow prediction model assumes that discharge at one location is equal to that a different lcoation on a unit area basis, commonly referred to as an equal unit area runoff (UAR) model.
Given the equal UAR model, an observed (proxy) catchment is a potential model for predicting the distribution of UAR for any unobserved (target) location,
A proxy is “informative” to a target if the proxy UAR is closer to the posterior target UAR than the maximum uncertainty (uniform distribution) prior.
A proxy is “disinformative” to a target if the maximum uncertainty (uniform distribution) prior is closer to the posterior target UAR than the proxy UAR.
The “closeness” of distributions is characterized by three measures from the general class of f-divergences, namely the total variation distance (TVD), the Kullback-Leibler divergence (\(D_{KL}\)), and the earth mover’s distance (EMD), also known as the Wasserstein distance.
For the Kullback-Leibler divergence \(D_{KL}\):
The (posterior) target UAR distribution is denoted by \(P\), and the proxy UAR distribution (model) is denoted by \(Q\).
A proxy model is informative for some target location if \(D_{KL}(P||\mathbb{U}) - D_{KL}(P||Q) > 0\)
The binary problem formulation is then:
The discriminant function maps the difference in the two divergences to a binary outcome corresponding to the sign of the resulting quantity \(Y = +1 \text{ if } D_{KL}(P||\mathbb{U}) - D_{KL}(P||Q) > 0 \text{ or } -1 \text{ otherwise}\)
The goal is to miminize the probability of incorrect predictions, defined by the (Bayes) error \(R_{\textbf{Bayes}}(\gamma, C) := \mathbb{P} \left(Y \neq \text{sign}(D_{KL}(\mathbf{P}||\mathbb{U}) - D_{KL}(\mathbf{P}||\mathbf{Q}) \right)\)
The model used to generate streamflow predictions is likewise simple.
The equal unit area runoff (EUAR) model assumes that discharge at one location is equal to that at another on a unit area basis. The EUAR model is widely used since it is a reasonable first approximation where catchments are nested, are in close proximity, or are similar in terms of the processes that govern the rainfall-runoff response. Indices describing physical characteristics related to runoff process controls have long been used in PUB. Where past work has focused on predicting streamflow itself (citation), or on predicting hydrological signatures defined as scalar indices that are said to encapsulate processes, we take a different approach to predict aggregate system behaviour.
The EAUR model is translated to the binary classification problem by assuming that
The question “will a model be better than random guessing” is formulated by assuming In order to map the prediction of streamflow to a binary label \(\mathcal{Y}\), .
-The EUAR model is profiled by its expected prediction compared to “random guessing”.
In the previous section, we mapped streamflow to entropy as a quantity describing the randomness of river systems. Since the (Shannon) entropy of the distribution does not embody one specific process, it does not fit with conventional classifications of hydrological signatures. Since it encompasses the entire distribution, the entropy represents some aggregate characteristic that does not attempt to disentangle the highly interrelated confounds of the hydrologic cycle.
This is accomplished by comparing unit area runoff (UAR) distributions in terms of
Conversely, this model produces predictions that are worse than random guessing, as we will show using a surrogate loss functions from the information theory literature.
This binary classification formulation brings the optimal (streamflow) sensor placement question into the domain of signal processing where much work has been done to prove statistical consistency using surrogate loss functions, substitutes for the non-convex 0-1 loss function.
In the first steps, we looked at the predictability of a common hydrological signature and an uncommon information measure, both computed on individual distributions. If the entropy measure is reasonably predictable from attributes, then the uncertainty of unmonitored locations might be predictable enough for some applications.
Here we make a comparison of runoff between large samples of location pairs and ask if:
it can be predicted whether a monitored catchment is a better model for a target catchment compared to random guessing.
The Kullback-Leibler Divergence (KLD) of the distribution of unit area runoff between two locations can be predicted from the attributes of both catchments (and their differences).
We continue to use the gradient boosted decision tree method.
Experimental setup#
The decision of where to place sensors to optimize the monitoring of a random field relies on estimating a function \(\gamma\) that maps a d-dimensional covariate vector \(X\) to some label or value \(Y\) that must be consistently predictable in order to make decisions. Estimating the discriminant function \(\gamma\) that predicts \(Y\) is one part of the problem. The other part of the problem is to determine a suitable (convex) loss function such that empirical risk minimization procedures such as support vector machines and gradient boosting can be applied. [Nguyen et al., 2009] describes the relationship between surrogate loss functions and another class of functions called f-divergences.
In the case of the streamflow monitoring network, the goal is to maximize the information that the total monitoring network captures about the total unmonitored space. Stated differently, the aim is to determine the set of streamflow monitoring locations that yields the greatest reduction in uncertainty over the total space. The total space is defined as any location in a river network that can be monitored, and we represent this space by the set of confluences of a river network over a regional/continental scale. The baseline for estimating the reduction in uncertainty is, maximum uncertainty, the uniform distribution. Examples of discriminant-loss function pairs \(\gamma \rightarrow \ell\) include:
\(\gamma\) predict streamflow at unmonitored locations using a parameterized rainfall runoff model with:
\(\ell\): SSE, MAE, MSE, NSE, KGE, etc.
\(\gamma\)
The discriminant function is the mapping of inputs to scalar valued outputs such that a loss can be computed. In the optimial sensor network problem, each possible monitoring location is evaluated for its potential to reduce uncertainty at all other unmonitored locations. A (proxy) location is evaluated for its utility in reducing uncertainty at another (target) location by a model which takes in a d-dimensional covariate vector of catchment attributes \(\mathbb{X}\) describing both the proxy and target locations, and a .
Binary Classification#
[Nguyen et al., 2009] describes the general binary classification problem as finding a discriminant function to predict \(\mathcal{Y} = \{ -1, +1 \}\) from a covariate vector \(\mathcal{X}\) given a sample of observations \(\{(X_1, Y_1), \dots, (X_n, Y_n)\}\), \((X, Y) \in (\mathcal{X}, \mathcal{Y})\). The aim of the binary classification problem is to minimize the expectation of the 0-1 loss \(\mathbb{P}(Y \neq \text{sign}(\gamma (X)))\) where the notation \(\text{sign}(\alpha) = 1 \text{ if } \alpha > 0 \text{ else } -1\) follows [Nguyen et al., 2009].
The goal is to minimize the risk of bad predictions, and the problem is to simultaneously determine both the mapping \(\mathbf{C}\) of continous streamflow \(\mathbf{S}\) to a discrete set of state symbols \(\mathbf{Z}\)
The goal is to miminize the (Bayes) error \(R_{\textbf{Bayes}}(\gamma, C) := \mathbb{P} \left(Y \neq \text{sign}(D_{KL}(\mathbf{P}||\mathbb{U}) - D_{KL}(\mathbf{P}||\mathbf{Q}) \right)\)
import os
import pandas as pd
import numpy as np
from time import time
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, row, column
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7
import xgboost as xgb
xgb.config_context(verbosity=2)
from sklearn.metrics import (
root_mean_squared_error,
mean_absolute_error,
roc_auc_score,
accuracy_score,
)
import data_processing_functions as dpf
from scipy.stats import linregress
output_notebook()
BASE_DIR = os.getcwd()
Load attribute data#
# load the catchment characteristics
fname = 'BCUB_watershed_attributes_updated.csv'
attr_df = pd.read_csv(os.path.join('data', fname))
attr_df.columns = [c.lower() for c in df.columns]
station_ids = attr_df['official_id'].values
print(f'There are {len(station_ids)} monitored basins in the attribute set.')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 4
2 fname = 'BCUB_watershed_attributes_updated.csv'
3 attr_df = pd.read_csv(os.path.join('data', fname))
----> 4 attr_df.columns = [c.lower() for c in df.columns]
5 station_ids = attr_df['official_id'].values
6 print(f'There are {len(station_ids)} monitored basins in the attribute set.')
NameError: name 'df' is not defined
Load pairwise attribute comparisons#
Load a few rows from one of the pairwise data files. These contain attributes about divergence measures that are computed on concurrent and non-concurrent time series at two monitored locations.
# open an example pairwise results file
input_folder = os.path.join(
BASE_DIR, "data", "processed_divergence_inputs",
)
pairs_files = os.listdir(input_folder)
test_df = pd.read_csv(os.path.join(input_folder, pairs_files[0]), nrows=1000)
kld_columns = [c for c in test_df.columns if 'dkl' in c]
kld_columns
['dkl_concurrent_uniform',
'dkl_concurrent_post_-5R',
'dkl_concurrent_post_-4R',
'dkl_concurrent_post_-3R',
'dkl_concurrent_post_-2R',
'dkl_concurrent_post_-1R',
'dkl_concurrent_post_0R',
'dkl_concurrent_post_1R',
'dkl_concurrent_post_2R',
'dkl_concurrent_post_3R',
'dkl_concurrent_post_4R',
'dkl_concurrent_post_5R',
'dkl_concurrent_post_6R',
'dkl_concurrent_post_7R',
'dkl_concurrent_post_8R',
'dkl_concurrent_post_9R',
'dkl_concurrent_post_10R',
'dkl_nonconcurrent_uniform',
'dkl_nonconcurrent_post_-5R',
'dkl_nonconcurrent_post_-4R',
'dkl_nonconcurrent_post_-3R',
'dkl_nonconcurrent_post_-2R',
'dkl_nonconcurrent_post_-1R',
'dkl_nonconcurrent_post_0R',
'dkl_nonconcurrent_post_1R',
'dkl_nonconcurrent_post_2R',
'dkl_nonconcurrent_post_3R',
'dkl_nonconcurrent_post_4R',
'dkl_nonconcurrent_post_5R',
'dkl_nonconcurrent_post_6R',
'dkl_nonconcurrent_post_7R',
'dkl_nonconcurrent_post_8R',
'dkl_nonconcurrent_post_9R',
'dkl_nonconcurrent_post_10R']
Define attribute groupings#
terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg'] #'gravelius', 'perimeter',
land_cover = [
'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010',
'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
soil = ['logk_ice_x100', 'porosity_x100']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
all_attributes = terrain + land_cover + soil + climate
len(all_attributes)
24
Set trial parameters#
# define the amount of data to set aside for final testing
holdout_pct = 0.10
nfolds = 5
n_boost_rounds = 2500
n_optimization_rounds = 20
#define if testing concurrent or nonconcurrent data
concurrent = 'concurrent'
# the input data file has an associated revision date
revision_date = '20240812'
all_test_results = {}
attribute_set_names = ['climate', '+land_cover', '+terrain', '+soil']
results_folder = os.path.join(BASE_DIR, 'data', 'kld_prediction_results')
if not os.path.exists(results_folder):
os.makedirs(results_folder)
attr_df.columns
Index(['region', 'official_id', 'drainage_area_km2', 'centroid_lon_deg_e',
'centroid_lat_deg_n', 'logk_ice_x100', 'porosity_x100',
'land_use_forest_frac_2010', 'land_use_shrubs_frac_2010',
'land_use_grass_frac_2010', 'land_use_wetland_frac_2010',
'land_use_crops_frac_2010', 'land_use_urban_frac_2010',
'land_use_water_frac_2010', 'land_use_snow_ice_frac_2010',
'lulc_check_2010', 'land_use_forest_frac_2015',
'land_use_shrubs_frac_2015', 'land_use_grass_frac_2015',
'land_use_wetland_frac_2015', 'land_use_crops_frac_2015',
'land_use_urban_frac_2015', 'land_use_water_frac_2015',
'land_use_snow_ice_frac_2015', 'lulc_check_2015',
'land_use_forest_frac_2020', 'land_use_shrubs_frac_2020',
'land_use_grass_frac_2020', 'land_use_wetland_frac_2020',
'land_use_crops_frac_2020', 'land_use_urban_frac_2020',
'land_use_water_frac_2020', 'land_use_snow_ice_frac_2020',
'lulc_check_2020', 'slope_deg', 'aspect_deg', 'median_el', 'mean_el',
'max_el', 'min_el', 'elevation_m', 'prcp', 'tmin', 'tmax', 'vp', 'swe',
'srad', 'low_prcp_duration', 'low_prcp_freq', 'high_prcp_duration',
'high_prcp_freq', 'mean_runoff'],
dtype='object')
def add_attributes(attr_df, df_relations, attribute_cols):
"""
Adds attributes from the df_attributes to the df_relations based on the 'proxy' and 'target' columns
using map for efficient lookups.
Parameters:
df_attributes (pd.DataFrame): DataFrame with 'id' and attribute columns.
df_relations (pd.DataFrame): DataFrame with 'proxy' and 'target' columns.
attribute_cols (list of str): List of attribute columns to add to df_relations.
Returns:
pd.DataFrame: Updated df_relations with added attribute columns.
"""
# Create dictionaries for each attribute for quick lookup
attr_dicts = {col: attr_df.set_index('official_id')[col].to_dict() for col in attribute_cols}
# Add target attributes
for col in attribute_cols:
df_relations[f'target_{col}'] = df_relations['target'].map(attr_dicts[col])
# Add proxy attributes
for col in attribute_cols:
df_relations[f'proxy_{col}'] = df_relations['proxy'].map(attr_dicts[col])
return df_relations
def predict_KLD_from_attributes(attr_df, target_col, holdout_pct, stations, nfolds, results_folder, partial_counts=False):
training_stn_cv_sets, test_stn_sets = dpf.train_test_split_by_official_id(holdout_pct, stations, nfolds)
all_test_results = {}
for bitrate in [4, 6, 8, 9, 10, 11, 12]:
t0 = time()
print(f'bitrate = {bitrate}')
fname = f"KL_results_{bitrate}bits_{revision_date}.csv"
if partial_counts:
fname = f"KL_results_{bitrate}bits_{revision_date}_partial_counts.csv"
input_data_fpath = os.path.join(input_folder, fname)
nrows = None
df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
df.dropna(subset=[target_col], inplace=True)
t1 = time()
print(f' {t1-t0:.2f}s to load input data')
# add the attributes into the input dataset
df = add_attributes(attr_df, df, all_attributes)
all_test_results[bitrate] = {}
input_attributes = []
# add attribute groups successively
for attribute_set, set_name in zip([land_cover, terrain, soil, climate], attribute_set_names):
print(f' Processing {set_name} attribute set: {target_col}')
input_attributes += attribute_set
features = dpf.format_features(input_attributes)
trial_df, test_df = dpf.run_xgb_trials_custom_CV(
bitrate, set_name, features, target_col, df,
training_stn_cv_sets, test_stn_sets, n_optimization_rounds,
nfolds, n_boost_rounds, results_folder
)
test_rmse = root_mean_squared_error(test_df['actual'], test_df['predicted'])
test_mae = mean_absolute_error(test_df['actual'], test_df['predicted'])
print(f' held-out test rmse: {test_rmse:.2f}, mae: {test_mae:.2f}')
print('')
# store the test set predictions and actuals
all_test_results[bitrate][set_name] = {
'trials': trial_df, 'test_df': test_df,
'test_mae': test_mae, 'test_rmse': test_rmse}
return all_test_results
Run Models#
priors_to_test = [-2, -1, 0, 1, 2]
for prior in priors_to_test:
target_col = f'dkl_{concurrent}_post_{prior}R'
test_results_fname = f'{target_col}_{prior}_prior_results.npy'
test_results_fpath = os.path.join(results_folder, test_results_fname)
if os.path.exists(test_results_fpath):
all_test_results = np.load(test_results_fpath, allow_pickle=True).item()
else:
all_test_results = predict_KLD_from_attributes(attr_df, target_col, holdout_pct, station_ids, nfolds, results_folder)
np.save(test_results_fpath, all_test_results)
bitrate = 4
12.02s to load input data
Processing climate attribute set.
3.08 ± 0.029 RMSE mean on the test set (N=20)
held-out test rmse: 2.89, mae: 2.00
Processing +land_cover attribute set.
2.34 ± 0.101 RMSE mean on the test set (N=20)
held-out test rmse: 2.16, mae: 1.44
Processing +terrain attribute set.
2.33 ± 0.107 RMSE mean on the test set (N=20)
held-out test rmse: 2.12, mae: 1.40
Processing +soil attribute set.
2.11 ± 0.082 RMSE mean on the test set (N=20)
held-out test rmse: 1.86, mae: 1.21
bitrate = 6
12.24s to load input data
Processing climate attribute set.
3.09 ± 0.034 RMSE mean on the test set (N=20)
held-out test rmse: 2.86, mae: 2.03
Processing +land_cover attribute set.
2.30 ± 0.081 RMSE mean on the test set (N=20)
held-out test rmse: 2.09, mae: 1.40
Processing +terrain attribute set.
2.25 ± 0.092 RMSE mean on the test set (N=20)
held-out test rmse: 2.03, mae: 1.35
Processing +soil attribute set.
2.05 ± 0.065 RMSE mean on the test set (N=20)
held-out test rmse: 1.84, mae: 1.23
bitrate = 8
12.19s to load input data
Processing climate attribute set.
3.21 ± 0.060 RMSE mean on the test set (N=20)
held-out test rmse: 2.95, mae: 2.21
Processing +land_cover attribute set.
2.24 ± 0.065 RMSE mean on the test set (N=20)
held-out test rmse: 1.99, mae: 1.34
Processing +terrain attribute set.
2.23 ± 0.067 RMSE mean on the test set (N=20)
held-out test rmse: 1.92, mae: 1.30
Processing +soil attribute set.
2.02 ± 0.048 RMSE mean on the test set (N=20)
held-out test rmse: 1.76, mae: 1.19
bitrate = 9
12.29s to load input data
Processing climate attribute set.
3.32 ± 0.076 RMSE mean on the test set (N=20)
held-out test rmse: 3.09, mae: 2.40
Processing +land_cover attribute set.
2.24 ± 0.054 RMSE mean on the test set (N=20)
held-out test rmse: 2.02, mae: 1.40
Processing +terrain attribute set.
2.23 ± 0.057 RMSE mean on the test set (N=20)
held-out test rmse: 1.95, mae: 1.35
Processing +soil attribute set.
2.04 ± 0.043 RMSE mean on the test set (N=20)
held-out test rmse: 1.77, mae: 1.23
bitrate = 10
12.09s to load input data
Processing climate attribute set.
3.46 ± 0.086 RMSE mean on the test set (N=20)
held-out test rmse: 3.22, mae: 2.56
Processing +land_cover attribute set.
2.26 ± 0.050 RMSE mean on the test set (N=20)
held-out test rmse: 1.96, mae: 1.40
Processing +terrain attribute set.
2.24 ± 0.057 RMSE mean on the test set (N=20)
held-out test rmse: 1.93, mae: 1.38
Processing +soil attribute set.
2.06 ± 0.045 RMSE mean on the test set (N=20)
held-out test rmse: 1.77, mae: 1.25
bitrate = 11
12.36s to load input data
Processing climate attribute set.
3.60 ± 0.082 RMSE mean on the test set (N=20)
held-out test rmse: 3.38, mae: 2.74
Processing +land_cover attribute set.
2.26 ± 0.055 RMSE mean on the test set (N=20)
held-out test rmse: 2.05, mae: 1.49
Processing +terrain attribute set.
2.25 ± 0.058 RMSE mean on the test set (N=20)
held-out test rmse: 1.96, mae: 1.43
Processing +soil attribute set.
2.10 ± 0.045 RMSE mean on the test set (N=20)
held-out test rmse: 1.86, mae: 1.35
bitrate = 12
12.49s to load input data
Processing climate attribute set.
3.76 ± 0.074 RMSE mean on the test set (N=20)
held-out test rmse: 3.57, mae: 2.96
Processing +land_cover attribute set.
2.31 ± 0.055 RMSE mean on the test set (N=20)
held-out test rmse: 2.01, mae: 1.47
Processing +terrain attribute set.
2.29 ± 0.051 RMSE mean on the test set (N=20)
held-out test rmse: 2.02, mae: 1.48
Processing +soil attribute set.
2.14 ± 0.042 RMSE mean on the test set (N=20)
held-out test rmse: 1.86, mae: 1.35
bitrate = 4
12.49s to load input data
Processing climate attribute set.
2.48 ± 0.153 RMSE mean on the test set (N=20)
held-out test rmse: 2.49, mae: 1.73
Processing +land_cover attribute set.
1.91 ± 0.165 RMSE mean on the test set (N=20)
held-out test rmse: 1.86, mae: 1.26
Processing +terrain attribute set.
1.89 ± 0.160 RMSE mean on the test set (N=20)
held-out test rmse: 1.85, mae: 1.24
Processing +soil attribute set.
1.73 ± 0.148 RMSE mean on the test set (N=20)
held-out test rmse: 1.68, mae: 1.09
bitrate = 6
12.28s to load input data
Processing climate attribute set.
2.43 ± 0.130 RMSE mean on the test set (N=20)
held-out test rmse: 2.46, mae: 1.74
Processing +land_cover attribute set.
1.81 ± 0.147 RMSE mean on the test set (N=20)
held-out test rmse: 1.77, mae: 1.22
Processing +terrain attribute set.
1.82 ± 0.148 RMSE mean on the test set (N=20)
held-out test rmse: 1.79, mae: 1.23
Processing +soil attribute set.
1.64 ± 0.139 RMSE mean on the test set (N=20)
held-out test rmse: 1.72, mae: 1.14
bitrate = 8
12.33s to load input data
Processing climate attribute set.
2.46 ± 0.119 RMSE mean on the test set (N=20)
held-out test rmse: 2.46, mae: 1.81
Processing +land_cover attribute set.
1.75 ± 0.130 RMSE mean on the test set (N=20)
held-out test rmse: 1.76, mae: 1.22
Processing +terrain attribute set.
1.74 ± 0.137 RMSE mean on the test set (N=20)
held-out test rmse: 1.69, mae: 1.17
Processing +soil attribute set.
1.59 ± 0.124 RMSE mean on the test set (N=20)
held-out test rmse: 1.61, mae: 1.08
bitrate = 9
12.13s to load input data
Processing climate attribute set.
2.51 ± 0.116 RMSE mean on the test set (N=20)
held-out test rmse: 2.51, mae: 1.89
Processing +land_cover attribute set.
1.73 ± 0.122 RMSE mean on the test set (N=20)
held-out test rmse: 1.72, mae: 1.21
Processing +terrain attribute set.
1.73 ± 0.127 RMSE mean on the test set (N=20)
held-out test rmse: 1.69, mae: 1.17
Processing +soil attribute set.
1.58 ± 0.119 RMSE mean on the test set (N=20)
held-out test rmse: 1.57, mae: 1.08
bitrate = 10
12.30s to load input data
Processing climate attribute set.
2.57 ± 0.112 RMSE mean on the test set (N=20)
held-out test rmse: 2.60, mae: 2.00
Processing +land_cover attribute set.
1.72 ± 0.120 RMSE mean on the test set (N=20)
held-out test rmse: 1.72, mae: 1.22
Processing +terrain attribute set.
1.72 ± 0.123 RMSE mean on the test set (N=20)
held-out test rmse: 1.69, mae: 1.19
Processing +soil attribute set.
1.58 ± 0.119 RMSE mean on the test set (N=20)
held-out test rmse: 1.59, mae: 1.10
bitrate = 11
12.43s to load input data
Processing climate attribute set.
2.64 ± 0.107 RMSE mean on the test set (N=20)
held-out test rmse: 2.68, mae: 2.10
Processing +land_cover attribute set.
1.71 ± 0.114 RMSE mean on the test set (N=20)
held-out test rmse: 1.71, mae: 1.21
Processing +terrain attribute set.
1.71 ± 0.129 RMSE mean on the test set (N=20)
held-out test rmse: 1.70, mae: 1.19
Processing +soil attribute set.
1.59 ± 0.109 RMSE mean on the test set (N=20)
held-out test rmse: 1.62, mae: 1.13
bitrate = 12
12.34s to load input data
Processing climate attribute set.
2.73 ± 0.103 RMSE mean on the test set (N=20)
held-out test rmse: 2.76, mae: 2.20
Processing +land_cover attribute set.
1.71 ± 0.118 RMSE mean on the test set (N=20)
held-out test rmse: 1.77, mae: 1.26
Processing +terrain attribute set.
1.73 ± 0.118 RMSE mean on the test set (N=20)
held-out test rmse: 1.70, mae: 1.19
Processing +soil attribute set.
1.61 ± 0.107 RMSE mean on the test set (N=20)
held-out test rmse: 1.59, mae: 1.11
bitrate = 4
12.96s to load input data
Processing climate attribute set.
1.89 ± 0.078 RMSE mean on the test set (N=20)
held-out test rmse: 1.97, mae: 1.44
Processing +land_cover attribute set.
1.41 ± 0.102 RMSE mean on the test set (N=20)
held-out test rmse: 1.32, mae: 0.93
Processing +terrain attribute set.
1.41 ± 0.101 RMSE mean on the test set (N=20)
held-out test rmse: 1.35, mae: 0.94
Processing +soil attribute set.
1.26 ± 0.083 RMSE mean on the test set (N=20)
held-out test rmse: 1.19, mae: 0.82
bitrate = 6
12.83s to load input data
Processing climate attribute set.
1.81 ± 0.062 RMSE mean on the test set (N=20)
held-out test rmse: 1.88, mae: 1.38
Processing +land_cover attribute set.
1.32 ± 0.088 RMSE mean on the test set (N=20)
held-out test rmse: 1.25, mae: 0.88
Processing +terrain attribute set.
1.32 ± 0.085 RMSE mean on the test set (N=20)
held-out test rmse: 1.27, mae: 0.88
Processing +soil attribute set.
1.19 ± 0.071 RMSE mean on the test set (N=20)
held-out test rmse: 1.09, mae: 0.75
bitrate = 8
12.75s to load input data
Processing climate attribute set.
1.77 ± 0.044 RMSE mean on the test set (N=20)
held-out test rmse: 1.82, mae: 1.34
Processing +land_cover attribute set.
1.27 ± 0.069 RMSE mean on the test set (N=20)
held-out test rmse: 1.17, mae: 0.81
Processing +terrain attribute set.
1.26 ± 0.063 RMSE mean on the test set (N=20)
held-out test rmse: 1.19, mae: 0.83
Processing +soil attribute set.
1.16 ± 0.053 RMSE mean on the test set (N=20)
held-out test rmse: 1.06, mae: 0.73
bitrate = 9
12.48s to load input data
Processing climate attribute set.
1.78 ± 0.041 RMSE mean on the test set (N=20)
held-out test rmse: 1.83, mae: 1.34
Processing +land_cover attribute set.
1.28 ± 0.064 RMSE mean on the test set (N=20)
held-out test rmse: 1.17, mae: 0.82
Processing +terrain attribute set.
1.25 ± 0.059 RMSE mean on the test set (N=20)
held-out test rmse: 1.17, mae: 0.81
Processing +soil attribute set.
1.15 ± 0.051 RMSE mean on the test set (N=20)
held-out test rmse: 1.03, mae: 0.73
bitrate = 10
12.52s to load input data
Processing climate attribute set.
1.81 ± 0.040 RMSE mean on the test set (N=20)
held-out test rmse: 1.85, mae: 1.36
Processing +land_cover attribute set.
1.27 ± 0.067 RMSE mean on the test set (N=20)
held-out test rmse: 1.18, mae: 0.85
Processing +terrain attribute set.
1.26 ± 0.063 RMSE mean on the test set (N=20)
held-out test rmse: 1.21, mae: 0.86
Processing +soil attribute set.
1.17 ± 0.043 RMSE mean on the test set (N=20)
held-out test rmse: 1.08, mae: 0.77
bitrate = 11
12.50s to load input data
Processing climate attribute set.
1.86 ± 0.039 RMSE mean on the test set (N=20)
held-out test rmse: 1.90, mae: 1.42
Processing +land_cover attribute set.
1.31 ± 0.060 RMSE mean on the test set (N=20)
held-out test rmse: 1.20, mae: 0.89
Processing +terrain attribute set.
1.29 ± 0.062 RMSE mean on the test set (N=20)
held-out test rmse: 1.22, mae: 0.90
Processing +soil attribute set.
1.21 ± 0.040 RMSE mean on the test set (N=20)
held-out test rmse: 1.11, mae: 0.82
bitrate = 12
12.40s to load input data
Processing climate attribute set.
1.91 ± 0.042 RMSE mean on the test set (N=20)
held-out test rmse: 1.95, mae: 1.46
Processing +land_cover attribute set.
1.35 ± 0.053 RMSE mean on the test set (N=20)
held-out test rmse: 1.31, mae: 0.98
Processing +terrain attribute set.
1.34 ± 0.054 RMSE mean on the test set (N=20)
held-out test rmse: 1.26, mae: 0.94
Processing +soil attribute set.
1.27 ± 0.042 RMSE mean on the test set (N=20)
held-out test rmse: 1.15, mae: 0.87
bitrate = 4
12.18s to load input data
Processing climate attribute set.
1.37 ± 0.059 RMSE mean on the test set (N=20)
held-out test rmse: 1.26, mae: 0.93
Processing +land_cover attribute set.
1.03 ± 0.046 RMSE mean on the test set (N=20)
held-out test rmse: 0.93, mae: 0.67
Processing +terrain attribute set.
1.03 ± 0.045 RMSE mean on the test set (N=20)
held-out test rmse: 0.93, mae: 0.66
Processing +soil attribute set.
0.94 ± 0.044 RMSE mean on the test set (N=20)
held-out test rmse: 0.82, mae: 0.58
bitrate = 6
12.20s to load input data
Processing climate attribute set.
1.28 ± 0.038 RMSE mean on the test set (N=20)
held-out test rmse: 1.19, mae: 0.85
Processing +land_cover attribute set.
0.99 ± 0.036 RMSE mean on the test set (N=20)
held-out test rmse: 0.90, mae: 0.65
Processing +terrain attribute set.
0.98 ± 0.034 RMSE mean on the test set (N=20)
held-out test rmse: 0.93, mae: 0.67
Processing +soil attribute set.
0.90 ± 0.034 RMSE mean on the test set (N=20)
held-out test rmse: 0.81, mae: 0.59
bitrate = 8
12.17s to load input data
Processing climate attribute set.
1.36 ± 0.036 RMSE mean on the test set (N=20)
held-out test rmse: 1.27, mae: 0.93
Processing +land_cover attribute set.
1.10 ± 0.036 RMSE mean on the test set (N=20)
held-out test rmse: 1.02, mae: 0.76
Processing +terrain attribute set.
1.10 ± 0.031 RMSE mean on the test set (N=20)
held-out test rmse: 1.02, mae: 0.77
Processing +soil attribute set.
1.01 ± 0.035 RMSE mean on the test set (N=20)
held-out test rmse: 0.91, mae: 0.70
bitrate = 9
12.15s to load input data
Processing climate attribute set.
1.48 ± 0.038 RMSE mean on the test set (N=20)
held-out test rmse: 1.39, mae: 1.04
Processing +land_cover attribute set.
1.22 ± 0.040 RMSE mean on the test set (N=20)
held-out test rmse: 1.12, mae: 0.86
Processing +terrain attribute set.
1.22 ± 0.036 RMSE mean on the test set (N=20)
held-out test rmse: 1.12, mae: 0.86
Processing +soil attribute set.
1.14 ± 0.038 RMSE mean on the test set (N=20)
held-out test rmse: 1.03, mae: 0.81
bitrate = 10
11.97s to load input data
Processing climate attribute set.
1.62 ± 0.039 RMSE mean on the test set (N=20)
held-out test rmse: 1.52, mae: 1.18
Processing +land_cover attribute set.
1.36 ± 0.042 RMSE mean on the test set (N=20)
held-out test rmse: 1.24, mae: 0.96
Processing +terrain attribute set.
1.35 ± 0.041 RMSE mean on the test set (N=20)
held-out test rmse: 1.25, mae: 0.98
Processing +soil attribute set.
1.26 ± 0.038 RMSE mean on the test set (N=20)
held-out test rmse: 1.14, mae: 0.90
bitrate = 11
12.41s to load input data
Processing climate attribute set.
1.72 ± 0.042 RMSE mean on the test set (N=20)
held-out test rmse: 1.63, mae: 1.27
Processing +land_cover attribute set.
1.45 ± 0.046 RMSE mean on the test set (N=20)
held-out test rmse: 1.33, mae: 1.04
Processing +terrain attribute set.
1.45 ± 0.045 RMSE mean on the test set (N=20)
held-out test rmse: 1.35, mae: 1.06
Processing +soil attribute set.
1.36 ± 0.040 RMSE mean on the test set (N=20)
held-out test rmse: 1.23, mae: 0.98
bitrate = 12
12.40s to load input data
Processing climate attribute set.
1.79 ± 0.044 RMSE mean on the test set (N=20)
held-out test rmse: 1.69, mae: 1.33
Processing +land_cover attribute set.
1.51 ± 0.048 RMSE mean on the test set (N=20)
held-out test rmse: 1.38, mae: 1.09
Processing +terrain attribute set.
1.51 ± 0.049 RMSE mean on the test set (N=20)
held-out test rmse: 1.39, mae: 1.09
Processing +soil attribute set.
1.43 ± 0.044 RMSE mean on the test set (N=20)
held-out test rmse: 1.29, mae: 1.02
bitrate = 4
12.59s to load input data
Processing climate attribute set.
0.93 ± 0.041 RMSE mean on the test set (N=20)
held-out test rmse: 0.81, mae: 0.64
Processing +land_cover attribute set.
0.78 ± 0.028 RMSE mean on the test set (N=20)
held-out test rmse: 0.71, mae: 0.55
Processing +terrain attribute set.
0.77 ± 0.030 RMSE mean on the test set (N=20)
held-out test rmse: 0.70, mae: 0.55
Processing +soil attribute set.
0.71 ± 0.025 RMSE mean on the test set (N=20)
held-out test rmse: 0.67, mae: 0.52
bitrate = 6
12.18s to load input data
Processing climate attribute set.
1.17 ± 0.033 RMSE mean on the test set (N=20)
held-out test rmse: 1.10, mae: 0.90
Processing +land_cover attribute set.
1.03 ± 0.012 RMSE mean on the test set (N=20)
held-out test rmse: 1.01, mae: 0.80
Processing +terrain attribute set.
1.02 ± 0.015 RMSE mean on the test set (N=20)
held-out test rmse: 1.01, mae: 0.79
Processing +soil attribute set.
0.96 ± 0.025 RMSE mean on the test set (N=20)
held-out test rmse: 0.96, mae: 0.74
bitrate = 8
12.23s to load input data
Processing climate attribute set.
1.63 ± 0.034 RMSE mean on the test set (N=20)
held-out test rmse: 1.60, mae: 1.35
Processing +land_cover attribute set.
1.36 ± 0.017 RMSE mean on the test set (N=20)
held-out test rmse: 1.40, mae: 1.13
Processing +terrain attribute set.
1.35 ± 0.019 RMSE mean on the test set (N=20)
held-out test rmse: 1.38, mae: 1.10
Processing +soil attribute set.
1.28 ± 0.021 RMSE mean on the test set (N=20)
held-out test rmse: 1.34, mae: 1.07
bitrate = 9
12.19s to load input data
Processing climate attribute set.
1.86 ± 0.037 RMSE mean on the test set (N=20)
held-out test rmse: 1.85, mae: 1.56
Processing +land_cover attribute set.
1.52 ± 0.019 RMSE mean on the test set (N=20)
held-out test rmse: 1.57, mae: 1.27
Processing +terrain attribute set.
1.50 ± 0.018 RMSE mean on the test set (N=20)
held-out test rmse: 1.55, mae: 1.25
Processing +soil attribute set.
1.42 ± 0.020 RMSE mean on the test set (N=20)
held-out test rmse: 1.49, mae: 1.19
bitrate = 10
12.08s to load input data
Processing climate attribute set.
2.05 ± 0.039 RMSE mean on the test set (N=20)
held-out test rmse: 2.05, mae: 1.73
Processing +land_cover attribute set.
1.65 ± 0.017 RMSE mean on the test set (N=20)
held-out test rmse: 1.72, mae: 1.40
Processing +terrain attribute set.
1.63 ± 0.016 RMSE mean on the test set (N=20)
held-out test rmse: 1.69, mae: 1.36
Processing +soil attribute set.
1.54 ± 0.017 RMSE mean on the test set (N=20)
held-out test rmse: 1.64, mae: 1.31
bitrate = 11
12.48s to load input data
Processing climate attribute set.
2.17 ± 0.040 RMSE mean on the test set (N=20)
held-out test rmse: 2.17, mae: 1.83
Processing +land_cover attribute set.
1.74 ± 0.016 RMSE mean on the test set (N=20)
held-out test rmse: 1.83, mae: 1.49
Processing +terrain attribute set.
1.72 ± 0.015 RMSE mean on the test set (N=20)
held-out test rmse: 1.79, mae: 1.45
Processing +soil attribute set.
1.63 ± 0.015 RMSE mean on the test set (N=20)
held-out test rmse: 1.72, mae: 1.38
bitrate = 12
12.40s to load input data
Processing climate attribute set.
2.23 ± 0.040 RMSE mean on the test set (N=20)
held-out test rmse: 2.24, mae: 1.88
Processing +land_cover attribute set.
1.79 ± 0.021 RMSE mean on the test set (N=20)
held-out test rmse: 1.86, mae: 1.50
Processing +terrain attribute set.
1.77 ± 0.015 RMSE mean on the test set (N=20)
held-out test rmse: 1.85, mae: 1.49
Processing +soil attribute set.
1.67 ± 0.012 RMSE mean on the test set (N=20)
held-out test rmse: 1.78, mae: 1.42
def load_result_by_prior(prior):
fname = f'dkl_{concurrent}_post_{prior}R_{prior}_prior_results.npy'
fpath = os.path.join(results_folder, fname)
return np.load(fpath, allow_pickle=True).item()
layout_dict = {}
reg_plots_dict = {}
res_r2_dict = {}
for prior in priors_to_test:
plots = []
result = load_result_by_prior(prior)
reg_plots_dict[prior] = {}
res_r2_dict[prior] = {}
for b, set_dict in result.items():
test_rmse, test_mae = [], []
attribute_sets = list(set_dict.keys())
y1 = [set_dict[e]['test_rmse'] for e in attribute_sets]
y2 = [set_dict[e]['test_mae'] for e in attribute_sets]
source = ColumnDataSource({'x': attribute_sets, 'y1': y1, 'y2': y2})
title = f'{b} bits (Q(θ|D)∼Dirichlet(α10^{prior}_K))'
if len(plots) == 0:
fig = figure(title=title, x_range=attribute_sets)
else:
fig = figure(title=title, x_range=attribute_sets, y_range=plots[0].y_range)
fig.line('x', 'y1', legend_label='rmse', color='green', source=source, line_width=3)
fig.line('x', 'y2', legend_label='mae', color='dodgerblue', source=source, line_width=3)
fig.legend.background_fill_alpha = 0.6
fig.yaxis.axis_label = 'RMSE'
fig.xaxis.axis_label = 'Attribute Group (additive)'
result_df = pd.DataFrame({'set': attribute_sets, 'rmse': y1, 'mae': y2})
best_rmse_idx = result_df['rmse'].idxmin()
best_mae_idx = result_df['mae'].idxmin()
best_rmse_set = result_df.loc[best_rmse_idx, 'set']
best_mae_set = result_df.loc[best_mae_idx, 'set']
best_result = set_dict[best_rmse_set]['test_df']
xx, yy = best_result['actual'], best_result['predicted']
slope, intercept, r, p, se = linregress(xx, yy)
sfig = figure(title=f'Test: {b} bits best model {best_rmse_set} (N={len(best_result)})')
sfig.scatter(xx, yy, size=1, alpha=0.6)
xpred = np.linspace(min(xx), max(xx), 100)
ybf = [slope * e + intercept for e in xpred]
sfig.line(xpred, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}')
# plot a 1:1 line
sfig.line([min(yy), max(yy)], [min(yy), max(yy)], color='black', line_dash='dotted',
line_width=2, legend_label='1:1')
sfig.xaxis.axis_label = r'Actual $$D_{KL}$$ [bits/sample]'
sfig.yaxis.axis_label = r'Predicted $$D_{KL}$$ [bits/sample]'
sfig.legend.location = 'top_left'
reg_plots_dict[prior][b] = sfig
res_r2_dict[prior][b] = r**2
plots.append(fig)
plots.append(sfig)
layout_dict[prior] = gridplot(plots, ncols=2, width=350, height=300)
show(layout_dict[-2])
show(layout_dict[-1])
show(layout_dict[-0])
show(layout_dict[1])
show(layout_dict[2])
col_figs = []
for prior in [-2, -1, 0, 1, 2]:
col = column([reg_plots_dict[prior][b] for b in [4, 6, 8, 9, 10, 11, 12]])
col_figs.append(col)
reg_layout = row(col_figs)
show(reg_layout)
from bokeh.transform import linear_cmap
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.layouts import gridplot
from bokeh.palettes import Viridis256, gray, magma
# Convert the nested dict into a DataFrame
df = pd.DataFrame(res_r2_dict).T # Transpose to get priors as columns
df.index.name = 'Prior'
df.columns.name = 'Bitrate'
df
| Bitrate | 4 | 6 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|
| Prior | |||||||
| -2 | 0.667416 | 0.672688 | 0.715624 | 0.738329 | 0.753776 | 0.753705 | 0.774256 |
| -1 | 0.654098 | 0.634524 | 0.682851 | 0.711636 | 0.717136 | 0.714581 | 0.737956 |
| 0 | 0.719002 | 0.746315 | 0.754369 | 0.770585 | 0.755401 | 0.747406 | 0.740944 |
| 1 | 0.671507 | 0.626959 | 0.569079 | 0.520582 | 0.498875 | 0.488139 | 0.475856 |
| 2 | 0.479448 | 0.342760 | 0.379661 | 0.435185 | 0.447244 | 0.463187 | 0.458513 |
# Melt the DataFrame to a long format
df_melted = df.reset_index().melt(id_vars='Prior', var_name='Bitrate', value_name='Value')
# Ensure the Bitrate values are ordered correctly (increasing order)
df_melted['Bitrate'] = pd.Categorical(df_melted['Bitrate'], categories=sorted(df_melted['Bitrate'].unique(), reverse=False), ordered=True)
# Create a Bokeh ColumnDataSource
source = ColumnDataSource(df_melted)
# Create a figure for the heatmap
p = figure(title="KL divergence from attributes: R² of test set by Prior and Bitrate",width=600, height=500,
tools="hover", tooltips=[('Value', '@Value')], toolbar_location=None)
# Create a color mapper
mapper = linear_cmap(field_name='Value', palette=magma(256), low=df_melted.Value.min(), high=df_melted.Value.max())
# Add rectangles to the plot
p.rect(x="Prior", y="Bitrate", width=1, height=1, source=source,
line_color=None, fill_color=mapper)
# Add color bar
color_bar = ColorBar(color_mapper=mapper['transform'], width=8, location=(0,0))
p.add_layout(color_bar, 'right')
# Format plot
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.xaxis.axis_label = r'$$Q(θ|D)∼\text{Dirichlet}(\alpha = 10^{a})$$'
p.yaxis.axis_label = r'$$\text{Quantization Bitrate (dictionary size)}$$'
p.axis.major_label_text_font_size = "10pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0
# Output the plot to an HTML file and display it
# output_file("heatmap.html")
show(p)
Discussion#
Since KL divergence \(D_{KL} = \sum P\log(\frac{P}{Q}) = +\infty \text{ when any } q_i \rightarrow 0\), the simulated \(Q\) is treated as a posterior distribution by assuming a uniform (Dirichlet) prior \(\alpha = [a_1, \dots, a_n]\). The prior \(\alpha\) is an additive array of uniform pseudo-counts used to address the (commonly occurring) case where \(q_i = 0\), that is the model does not predict an observed state \(i\). In this experiment we tested a wide range of priors on an exponential scale, \(\alpha = 10^a, a \in [-2, -1, 0, 1, 2]\).
The scale of the pseudo-count can be interpreted as a strength of belief in the model. Small \(a\) represents strong belief that the model produces a distribution \(Q\) that is representative of the “true” (observed posterior) distribution \(P\), and for a fixed \(p_i\) the effect of a decreasing \(a\) on the discriminant function \(D_{KL}\) yields a stronger penalty for a model that fails to predict an observed state. A large \(a\) represents weak belief that the model produces a distribution \(Q\) that is representative of \(P\), since \(Q\) approaches the uniform distribution \(\mathbb{U}\) as \(a\) increases.
Adding pseudo-counts has the effect of diluting the signal for the gradient boosting model to exploit in minimizing prediction error. Analogously, varying the bitrate, or the size of the dictionary used to quantize continuous streamflow into discrete states, also adds quantization noise since the original streamflow signals are stored in three decimal precision and they are quantized into as few as 4 bits (16 symbol dictionary) and as many as 12 bits (4096 symbol dictionary). The range of dictionary sizes is set to cover the expected range of rating curve uncertainty, which is generally considered multiplicative and expressed as a % of the observed value.
As shown by the results, priors representing the addition of \(10^1 \text{ to } 10^2\) pseudo-counts diminishes the performance of the gradient boosted decision tree model, regardless of the dictionary size, or the number of possible values provided by the quantization. Heavily penalizing unpredicted states does not have as great an impact as anticipated, perhaps as a result of the corresponding \(p_i\) values also being small.